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Executive Summary 

When examinees from two different subgroups have the same ability distribution (or are “matched” on 
ability) but are not equally likely to answer a particular item correctly, the item is said to exhibit DIF 
(differential /tern functioning; that is, the item functions differently in the two groups). When test data are 
analyzed, a statistical measure of DIF is calculated for each item so that items with large values of DIF (i.e., 
items with a large difference in the probability of equal ability examinees in the two groups answering correctly) 
can be investigated to determine if the item should be removed from the test and/or item pool (the group of 
items from which new tests are assembled). The Mantel-Haenszel (MH) procedure, which is used at the Law 
School Admission Council (LSAC), has become the most widely used procedure for measuring DIF and is 
recognized as the testing industry standard. The behavior of the MH DIF parameter is well understood for items 
on which no guessing occurs, but not for items where guessing does occur; often the case with multiple-choice 
items. 

This research report presents a general formulation of the MH DIF parameter that is equally appropriate for 
items on which guessing occurs and for items on which no guessing occurs. The value for this parameter is 
calculated for numerous realistic conditions to explore its behavior in situations where DIF might occur with 
real data. Practitioners have assumed that the MH DIF parameter behaves similarly regardless of guessing 
behavior, but our results indicate that guessing can affect the parameter’s value for relatively difficult items. As 
a result, the MH DIF statistic should be used with caution until the apparent deficiencies of this procedure are 
better understood or corrected. 

Before items are tested empirically for DIF at LSAC, and even before they are pretested (administered to 
examinees for the first time), they are subjected to rigorous sensitivity reviews. Additionally, real data do not 
mimic simulated data exactly. Thus, the implications of this study on the routine operational task of identifying 
DIF at LSAC are still unknown, and may in fact be minimal. However, because some items on the Law School 
Admission Test (LSAT) are known to exhibit guessing behavior, the results certainly suggest that additional 
research is warranted. 



Abstract 

The Mantel-Haenszel (MH) differential item functioning (DIF) parameter for uniform DIF is well defined 
when item responses follow the two-parameter-logistic (2PL) item response function (IRF), but not when they 
follow the three-parameter- logistic (3PL) IRF, the model typically used with multiple-choice items. This 
research report presents a general formulation of the MH DIF population parameter for any IRF and presents 
results for numerous 3PL uniform DIF conditions. The results indicate that for items of medium or high 
difficulty, the 2PL DIF parameter formulation can overestimate the 3PL DIF parameter and the MH DIF 
estimator may exhibit less than expected power to identify even substantial DIF in certain situations. 

Introduction 

Differential item functioning (DIF) is said to occur in an item when examinees of equal proficiency (on the 
construct measured by a test), but from separate populations, differ in their probability of answering the item 
correctly. Although a large number of statistical procedures have been developed to detect DIF in test data, the 
Mantel-Haenszel (MH) procedure (Holland & Thayer, 1988), used at the Law School Admission Council 
(LSAC), has become the most widely used methodology and is recognized as the testing industry standard. 
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The behavior of the MH DIF estimator, A , with respect to a number of factors, has been studied extensively in 
simulation studies (see, for example, Allen & Donoghue, 1996; Donoghue, Holland, & Thayer, 1993; Roussos 
& Stout, 1996; Shealy & Stout, 1993; and Uttaro & Millsap, 1994). However, it has not been determined how 
well A estimates its corresponding population parameter, A, when a DIF item is modeled by a three-parameter- 
logistic (3 PL) function (Bimbaum, 1968). Although the behavior of A is well understood when responses 
follow either a one- or two-parameter logistic fimtion (1PL or 2PL, e.g., Donoghue et al., 1993), no general 
formulation of A has been derived, although possible general formulas have been defined (e.g., Spray & Miller, 
1992). This lack of knowledge about A in the case of 3PL items has limited the evaluation of the statistical bias 
in A and has also hindered the understanding of the observed effects of simulation study factors on A . In 
particular, Allen and Donoghue (1996); Donoghue, Holland, and Thayer (1993), and Uttaro and Millsap (1994) 
all reported that the difficulty level of a 3 PL DIF item can have a sizable effect on the magnitude of A , but 
none of these studies could adequately explain the cause of this effect. Type I error studies by Allen and 
Donoghue (1996) and Roussos and Stout (1996) have indicated that a statistical bias is sometimes present in A , 
and that this bias varies with the difficulty level of the 3 PL item being tested for DIF. (Statistical bias can be 
estimated in Type I error studies because the true amount of DIF is known to be zero.) Type I error bias alone, 
however, does not fully explain the observed relationship between A and difficulty level in simulated 3 PL DIF 
items. The purpose of this research report is to present a formulation of the population DIF parameter for the 
MH DIF estimator that is appropriate for any IRF model, including the 3 PL model, and to describe through a 
systematic set of calculations the behavior of this DIF parameter with respect to a number of examinee and item 
factors. In particular, it will be shown that the unexplained behavior of A with respect to difficulty level 
observed in past simulation studies can be explained, at least in part, by the behavior of the MH DIF population 
parameter. Moreover, it will be shown that this behavior of A has important practical implications for the 
detection of DIF in real data analyses. 

Item Response Theory Terminology 

Item response theory (IRT) describes the relationship between the ability or proficiency, 9, of examinees on a 
construct and their probability, /’/(G), of a correct response on an item / that measures that construct. In this 
research report, the following notations will be used: 

/ = item number, 

j = examinee number, 
n = number of items on test, 

N = number of examinees, 

Xij = random variable for the response of examinee j to item i {Xu =1 indicates a correct response 
and Xy = 0 indicates an incorrect response), and 

P{Xjj = 1 1 0 y ) = P(Qj) = probability of a correct response on item / for an examinee j having ability 
0j. The functional representation used for P(Qj) is called the item response function (IRF). 

Multiple-choice tests typically give an examinee four or five options from which to choose the correct 
response to each item. For such items, empirical observation has shown that as examinee ability decreases, the 
probability of a correct response does not decrease asymptotically to zero, but rather to a fairly substantial finite 
value, usually between 0.10 and 0.25. It is generally believed that this non-zero lower asymptote of the IRF for 
a multiple-choice item is due in part to examinees having a finite probability of guessing the correct answer to a 
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multiple-choice item due to the item format. Because of this belief, the lower asymptote parameter of an IRF is 
commonly referred to as the guessing (or pseudo-guessing ) parameter. The most common parametric IRF that 
is used to model and simulate examinee responses to multiple-choice items is the 3PL model of Bimbaum 
(1968), which is given by 



P(Xjj = 1 1 0y) = c,+ 



1 ~ c i 



1 + e -'-7 MOj-b,) 



( 1 ) 



where 

a, = the discrimination parameter for item i, 
bj = the difficulty level of item i, and 
Cj = the lower asymptote for item i. 

When c/ = 0, the IRF is referred to as the two-parameter logistic (2PL) IRF. When c/ = 0 and a/ = 1 (or is 
constant across items), the IRF is referred to as the one-parameter logistic (1PL) IRF. The 1PL and 2PL models 
are often inadequate for modeling responses to multiple-choice items, thus the 3PL model is commonly used. 

DIF Terminology 

As stated above, DIF is defined as occurring in an item when examinees of equal proficiency, but from 
separate populations, differ in their probability of answering the item correctly. The item that is being tested for 
DIF is commonly referred to as the studied item. The populations of interest for DIF analyses at LSAC are 
based on ethnicity, gender, and geography (United States and Canada). The populations are categorized into a 
reference group population (Caucasians, males, or U.S. citizens) and a set of focal group populations (various 
minority groups, females, or Canadians). For didactic purposes we will limit our discussion to tests that are 
intended to measure a unidimensional construct; the situation under which the MH statistic is intended to be 
used. The proficiency of an examinee on the unidimensional construct will be referred to as 0. 

Using the above DIF terminology, a studied item is said to display DIF when reference group examinees and 
focal group examinees matched on 0 do not have the same probability of a correct response on the item. The 
most common procedure used for modeling and simulating DIF in an item is to use a different 3PL IRF for the 
reference group than for the focal group. The reference group IRF is denoted by Pr(Q) and the focal group IRF 
by Pf{Q). When the only difference between the reference and focal group IRFs is in the difficulty parameter 
(bj) of the studied item, the resulting DIF is referred to as uniform DIF because such DIF is graphically 
represented as a uniform horizontal shift in the IRF for one group relative to the other (see Figure 1). All other 
forms of DIF are referred to as nonuniform DIF. 
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FIGURE 1 . Three-parameter-logistic item response functions for an 
item that exhibits uniform DIF against the focal group. For both 
groups, a = 1 and c = .2. In the reference group, b = -.1 25, and in 
the focal group, b = +. 125. 

Let DIF(0) be defined as the magnitude of DIF in a studied item at a particular value of 0. A variety of 
formulations of DIF(0) exist in terms of P R (Q) and Pp(Q). One common formulation parametrizes DIF(0) as the 
ratio of the odds of a correct response on the studied item for the reference group, to the odds of a 

correct response in the focal group, P^/Q^)- If there is no DIF in the studied item [i.e, if P R (Q) = Pp(Q)] 
then the odds ratio, denoted by a(0), is equal to 1 . In the case of uniform DIF within the framework of IRT, 
when the studied item is 1PL or 2PL, a(0) can be shown through simple algebraic manipulation to be a constant 
across 0 given by 




PfW)IQf<9) fr(9)0t( 6) ’ 

where Q = 1 - P and b R and bF are the difficulty parameters of the studied item for the reference and focal 
groups, respectively. Thus, /«[a(0)] is equal to -1 .7 a(b R -b F ) when items follow the 1 PL or 2PL model. Hence, 
in the case of 1PL, the log-odds-ratio at any 0 is simply —1.7 times the difference in difficulty parameters for the 
two groups. In the case of 3PL uniform DIF, the odds ratio is not a constant across 0 and is given by 



o(0) = 



1 + ce _1 - 7a(0_ ^ ) 

1+ce - 17 *( 0 -v) 



£- 1*7 a{b R -b F ) 



(3) 



The behavior of the odds ratio, a(0), is important because, as will be discussed in more detail below, the MH 
DIF statistic is based on the estimation of similar odds ratios. Thus, it will be helpful to refer back to the 
previous equations when we review the MH DIF statistic and when we derive the equation for the MH 
DIF parameter. 
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The Mantel-Haenszel DIF Statistic 

To evaluate whether a studied item displays DIF, examinees are first separated into reference and focal 
groups (for example, males and females). Next the reference and focal groups are matched on ability, ideally 0. 
Because 0 is an unobservable variable, DIF statistics must approximate matching on 0 with a matching based on 
the observable data at hand. The MH DIF statistic matches examinees on the basis of total test score, including 
the score on the studied item. Holland and Thayer (1988) have shown that when all items on a test follow the 
1PL model (in which case the total test score is a sufficient statistic for 0) matching on total test score is the best 
approximation to matching on 0. When all the items on the test follow the 2PL or 3PL models, matching on total 
test score will be asymptotically equivalent to matching on 0 (Stout, 1990), which suggests that for sufficiently 
long tests matching on total score will well approximate matching on 0. Simulation studies (such as Allen & 
Donoghue, 1996; Donoghue, Holland, & Thayer, 1993; Roussos & Stout, 1996; Shealy & Stout, 1993; and 
Uttaro & Millsap, 1994) have demonstrated the efficacy of matching on total score for 3PL items, although 
significant breakdowns can sometimes occur when the number of items on the test is small (for example, 25 
or fewer items) and the reference and focal groups display large mean score differences (for example, one 
standard deviation). 

After reference and focal group examinees are matched on total test score, a 2 x 2 x S contingency table is 
formed, where S is the number of different values of total test score. At each score level 5 the data can be 
arranged as a 2 x 2 table, as shown in Table 1 . 



TABLE 1 

2x2 contingency table 





Correct 


Incorrect 


Total 


Reference group (R) 


CRs 


iRs 


NRs 


Focal group (F) 


CFs 


IFs 


NFs 


Total group 


Cxotal,s 


Ixotafs 


Nxotal,s 



Cfc indicates the number of reference group examinees at score level s who answered the studied item 
correctly. The other variables in Table 1 are analogously defined. If the item does not display DIF, the observed 
odds of a correct response for the two groups in each 2x2 table should be approximately the same because 
examinees in the two groups are roughly matched on ability. If the two groups do not have approximately the 
same odds of a correct response, the item is said to be functioning differently in the two groups (i.e., displays 
DIF). Thus the ratio of the reference group odds of a correct response on the studied item to the focal group 
odds is one natural score-level DIF estimator that can be formed from the contingency table. This odds-ratio 
estimator for score level s , a s , is given by 






__ Cfol Fs 
C Fs I Rs 



( 4 ) 




In the case of no DIF, a s should be approximately one for all s because the true odds for the two groups, 
when matched on ability, will be the same. 

When a s is assumed to be estimating a constant across s , an average value of a s can be used as an overall 
measure of the DIF in an item. The MH odds ratio (Mantel & Haenszel, 1959), a , is a weighted average of a s 
and is given by 
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a = 



f I c ^ 

1 Rs K -'Fs 

M 

\ iy Total, s ) 



a 



f I C ^ 

1 Rs^Fs 

N 

Totals J 



(5) 



The MH odds ratio provides a weighted average! for giving stable estimation of a when a s is estimating a 
constant odds ratio across all score levels. 

Holland and Thayer (1988) defined the MH DIF estimator (A ) as 



A =-2.35 ln(a). (6) 

This transformation of a places A on the Educational Testing Services (ETS) “delta scale” in the case of 
1PL. The delta scale is an inverse normal transformation of the percent correct to a linear scale with a mean of 
13 and a standard deviation of 4 and is used as an index of item difficulty by ETS test development staff. The 
A statistic is interpreted as a difference in the difficulty of items for the reference and focal groups on the delta 
scale (Zieky, 1993). 



The Mantel-Haenszel DIF Parameter 

The goal of this section is to derive an expression for a theoretical MH DIF parameter, A, that represents the 
expected value for A that would be obtained for a studied item if examinees could be matched on 0 exactly. 

The examinees’ 0’s will be assumed to be sampled from an infinitely large population with continuous 0 density 
functions for the reference and focal group populations and with a fixed ratio of reference group population size 
to focal group population size. Such assumptions are typical of DIF simulation studies. 

The general approach taken in this derivation is to assume that examinees are matched on a carefully chosen 
theoretical matching test, then to carefully let test length go to infinity to result in the desired matching of 
examinees on 0. The derivation begins by assuming examinees are matched based on their scores on a 
theoretical matching test (rather than a real or simulated one) that sorts examinees by test scores exactly (i.e., 
with no error). Assuming a matching test that has perfect reliability allows the convenience of not including the 
score on the studied item in the matching criterion. The derivation could just as well be carried out with the 
score on the studied item included in the matching criterion and would, in the end, result in the same formula for 
the MH DIF parameter. However, the derivation is less cumbersome when the score on the studied item is not 
included in the matching criterion. 

Consider a matching test consisting of S Guttman items (Guttman, 1944) that span the ability range from 
-Vs to -yfs at equal intervals for both the reference and focal groups. Let the difficulty parameters of these 
items be denoted by Pj and assume that they are on the same scale as the 0’s. The ordered difficulty parameters 
will be denoted by P(j), where P(j+i) - P(j) = 5 for all / = 1 to S. Because the items are equally spaced over the 



1 a s will be difficult to estimate when either Ifo is close to 0 or Cf s is close to 0. In such cases, a s will be very unstable and cause 
large variability in the estimation of a . Thus when estimating a constant odds ratio, it makes sense to use an estimator that gives 
proportionally less weight to score cells according to how close the cell is to having /fo = 0 or Cf s = 0. By using weights proportional to 
lRs c Fs’ the Mantel-Haenszel odds ratio accomplishes this objective. Thus the weights in Equation 5 will be small for cells where d 5 
will be unstable. 
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above stated ability range, we obtain 5 = ijs IS = 2/ 4s . If an examinee obtains a score of s, this indicates that 
the examinee’s 0 was at least (3( s ), but less than (3( S ) + 5. Thus, the probability of an examinee with ability 0 
obtaining a score of exactly s, denoted by P(i|0), is then given by 

P(s|0) = 1 if p (s) < 0 < p (s) + 5, and (7) 

P(5|0) = 0 otherwise. 

Now consider a specific cell Cr s in the 2 x 2 x S contingency table (see Table 1). The theoretical probability 
of a particular examinee from the reference group, with ability 0, contributing to this cell is given by 

P(s\Q)P r (Q), ( 8 ) 

where Pr(Q) is the reference group IRF for the studied item, as defined previously, which is not assumed to take 
any particular functional form for either the reference group or focal group. By assuming that the reference 
group 0’s follow some underlying distribution^©), we can then determine the expected total cell count for Cr s , 
for a sample of Nr reference group examinees from the following integral: 



|P(5 1 0)P*(0)/*(0)</0 . (9) 



Similar equations can be developed for all the other cells in the 2 x 2 x S contingency table. Because the only 
0’s that contribute to Cr s are between (3( S ) and P( S ) + 5, and because P(.s|0) is unity in this range of 0’s, Equation 
9 can be written as 

P(s)+<? 

k(0)/*(e)</e . (io) 

p(.> 

This expression may be simplified by invoking the mean-value theorem to replace the integral with 
a product. According to the mean-value theorem, there exists some value 0 c D in the interval 

I ^ 

Pcs) < < p(s) + S that, when evaluated in the integrand expression [F/j(0)//j(0) , in this case], then 

multiplied by the width of the interval (5, in this case), will result in the same value as the original integral. 
Thus, by applying the mean-value theorem to Equation 10, E[C^] can be written as 

E[Crs] = N r P r (Q' Crs )/ r (Q' Crs ) 5 , (11) 




for some P (s ) < e'c fo < P( S ) + 5. 

Similar expressions can be derived for E[C/rJ, E[//fr], and E[7/r 5 ], with corresponding ability mean values 
9 C Fs , 0 , and 0 i Fs . The expression for E[A 7 - 0/a / s ], which is required for the MH weights, is given by 



8 



= N Total {y F [P F {Q'c Fs )/f(0’c Fi ) + 0F(0'/ fJ (0’/ ft )] 

+ T/ f [^(0 , c & )//f(0’c fo ) + 0*(e'/ fo )//f(0'/ fo )]}5 

where Nj- ota i = TV/? + A ? /r, the total number of examinees, 

y F is the proportion of the total number of examinees who are in the focal group, and 
y R is the proportion in the reference group. 

An equation for a, which is analogous to the equation of a (Equation 5), can then be specified by 
substituting in these expected values and simplifying, giving 



z 
^ . 



QrV'u, )M»i„ We,. Vf <«c„ >5 



*Rs 



Fs 



' Fs 



a = 



yFWc r , )f F <f>C n >+0K9', ft )/ f (6' ; „ )] + Y fi Wc„ )/«(9 'c». > + &<9',„. )/r(9',.. )] 



-Rs 



1 Rs 



1 Rs 



o(s) 



8r(9'/„ )/r( 9';„ )J>(e'c r . )M»C n )« 



y F [P F (Vc Fs )/f(0'c Fj )+eF(0’/ Fi )/F(0’/ ft )]+r*Wc Bt )/r(&c b , )+Qr( q, i Rs )/*(% )], 



(13) 



where 



“(*) = 



/V(9'c, )0/f(0'/,. ) ' 



(14) 



Finally, we let test length, S, become asymptotically large while maintaining the equally spaced Guttman 
items and the difficulty parameter range of —Js to Vs . Hence, the 0 range approaches -oo to +oo , the width 
of the intervals (5) approaches 0, and0'c fo , 0c Fs , 0/ fo , and Qi Fs all approach the same value, say 0 S . Because 
score s approaches a continuous variable, 0 S may be replaced with 0, and the summations over the 5 intervals in 
Equation 13 can then be replaced by integrals over dQ. Because all the different 0' values for the same score 5 
approach a common value, the denominator of the MH weights simplifies to y R fp{Q) + y R f R (0). Thus, our final 
formula for a [which applies weights to a(0) in an analogous manner as the MH odds ratio applies weights to 
dj ] is given by 



a = 



— ~°° 



) 8«( 9)f f (8) 

o Yf/f( 0) + Y/f//f(0) 

f /0_/'A'lP_/'A'l /*(0)/f(0) sto 



(15) 



This formula is the same as the one postulated by Spray and Miller (1992). 

Note that a(0) reduces to Equation 2 in the case of 1PL and 2PL, and it reduces to Equation 3 in the case of 
3PL. This quantity is substituted into the final formula for the MH DIF parameter, 
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A = -2.35 ln(a). (16) 

Although no closed-form solution exists for the general solution of the integral for a (Equation 15), 
numerical integration can be used to compute a. We developed software for the computation of A, via a using 
Equation 15, by employing a standard technique, Simpson’s 1/3 Rule [see, for example, Gerald, 1970] and 
using 199 integration points across a 0 range from -6 to 6. This software for calculating the MH DIF population 
parameter is available upon request from the authors. 

Recall that when the IRF of the studied item follows either the 2PL or 1PL model and DIF is uniform, a(9) is 
a constant with respect to 0, as given in Equation 2, and comes outside the integral in the numerator in Equation 
15. The integrals in the numerator and denominator then cancel each other out, which results in 

a = e~' ia(b R -b F ) ( 17 ) 

and, consequently, 



A = -235ln(e- lla ^- b ^) = -2.35(-\Ja(b R -b F )) = 4a(b R -b F ), (18) 

which is the well-known form for A in the 1PL and 2PL cases (e.g., Donoghue, Holland, & Thayer, 1993). The 
estimate of A is interpreted as an estimate of the difference in difficulty level for the reference and focal groups 
on the studied item as measured on the ETS delta scale (Zieky, 1993). In the case of 1PL, this interpretation of 
A is justified because A is 4(bp - bp), a difference in b’s multiplied by the appropriate constant. The 
interpretation of A in the 2PL case is less direct because of the a parameter in the 4 a(bp - bp) formula. 

In the 3PL case, the odds ratio parameter a(0) is not constant across 0 [see Equation 3], thus the simple 
4 a(bp - bp) rule does not apply. In this case, because a(0) is not constant across 0; and because A does not have 
a simple form, it is not clear what the interpretation of A should be. 

Because A has become the industry standard (and the most widely used) DIF estimator, and it is used with 
data that is modeled well by the 3PL IRF, it is important to know whether the interpretation of A as an 
estimator of difference in difficulty level holds approximately true when the studied item is modeled with 3PL 
uniform DIF. 



Computation of A Under Realistic Conditions 

Our formulation of the MH DIF parameter was calculated for numerous uniform DIF conditions. The 0’s for 
the reference group were always specified as following a standard normal distribution. The 0’s for the focal 
group were specified as following a normal distribution with unit variance; the mean was set at -1, -.5, 0, .5, or 
1 (five levels). The ratio of the reference group size to the focal group size (Ratio) was 5, 4, 3, 2, or 1 (five 
levels). The discrimination parameter (a) was set at .5, .7, .9, 1 . 1 , or 1 .3 (five levels). The lower asymptote (c) 
was set at 0 (the 2PL case), .1, .2, or .3 (four levels). The difficulty parameter for the reference group (bp) was 
set at -2, -1 .5, -1 , -.5, 0, .5, 1 , 1 .5, or 2 (nine levels). The difficulty parameter for the focal group (bp) was set 
equal to bp(i.e., no DIF) and to bp ± .1, .2, ..., 1.5 (31 levels). (That is, the difference in b values for the two 
groups was 0, ±.l, ±.2, ..., ±1.5.)The item parameter factors were fully crossed, resulting in5x4x9x31 = 
5,580 “items” for which A was calculated (levels of a by c by bp by bp) for the 5 x 5 = 25 examinee 
combinations (focal group mean by ratio of reference to focal group size). A representative subset of the results 
will be presented. 
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ETS has developed a classification scheme to flag items with DIF (see Zieky, 1993), which is also used at 
LSAC. Items where | A | is at least 1 .5 and significantly greater than 1 .0 are classified as “C” items, or moderate 
to large DIF. Items with a flag of C are routinely excluded from test assembly at LSAC. Items where |A | is not 
significantly greater than 0.0 or | A | is less than 1.0 are classified as “A” items, or negligible DIF. All other 
items are classified as “B” items, or slight to moderate DIF. Note that in the 1PL case, according to the equation 
A = 4 (b R -b) r), a difference in b values of .375 or more would result in |A| values corresponding to a C flag 
( A > 1.5). Thus 24 of the 31 differences in b values used in the present study represent substantial DIF levels 
(b R - b F = ± .4, ± .5, ..., ± 1.5). (Statistical significance is not relevant here because we are calculating the 
parameter itself, not a statistic.) 

The panels in Figure 2 show how c affects A at the various values of b R and bf. For all items Ratio = 5, 
a= 0.9 and focal group mean =-1.0. The values of bf are plotted along the horizontal axis, and A is plotted 
along the vertical axis. The values of b R are indicated by the different symbols. For instance, all solid circles 
represent b R = 2.0, and the position of a particular solid circle in reference to the horizontal axis indicates the 
value of bf. When there is no difference in difficulty for the two groups ( b R = bf) for all levels of b R , A = 0 (no 
DIF), as expected. The horizontal lines in the figure at A values of ± 1.5 and ± 1.0 delineate the A cutoff values 
for C and B DIF items, respectively. 



A. c=0 (2 PL Case) 



B. c=1 




C. c= 2 D. c=. 3 




FIGURE 2. Variation of A with b R (b value of reference group, indicated by the symbol), bf (b value of 
focal group, indicated along the horizontal axis), and c (which varies across panels) for fixed values 
of a = .9. Ratio = 5, and focal group mean = -1.0. 
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For c = 0 (the 2PL case, shown in panel A of Figure 2), we find, as expected, that A =4 a(b R -bp) . In the 
1 PL case ( a = 1 and c = 0), b F - bp = 0.4 would result in a C level value of A. However, for the items in panel 
A of Figure 2, ( a = .9), it is clear that when a * 1, A is not merely a transformation of bp - bp on to the ETS 
delta scale as it is usually interpreted. This effect of a on the interpretation of A in the 2PL case can be seen 
more clearly if we compare what A would be for different values of a for the same value of bR - bp. For 
example, if a were set equal to 1.3, a value of bR- bp of about 0.29 yields a A of 1.5, whereas if a were set 
equal to 0.5, the same bR— bp of 0.29 results in a A of only 0.58 in the 2PL case. Even though A = 1 .5 (when a = 
1.3) is interpreted as an indication of a larger difference in difficulty than A = 0.58 (when a = 0.5), the difficulty 
differences are in actuality exactly the same (bR- bp = 0.29). As interesting as these 2PL results are, they are, 
for the most part, merely didactic because item responses on the Law School Admission Test (LSAT) and other 
nationally administered standardized tests often follow the 3PL model due to the extensive use of multiple- 
choice items on these tests. Therefore, we now turn our attention to the results for the 3PL model. 

Panels B, C, and D of Figure 2 show that the introduction of positive values of c into the IRF model has quite 
a large effect on A. Clearly, the 2PL formula, A = 4 a(b R -bp), is not a very good general approximation to A in 
the case of 3PL uniform DIF. Except for the very lowest values of bR, A is moderately to substantially reduced 
from its 2PL value as bR increases, especially when the item favors the reference group (i.e., negative bR - bp 
values, and hence A values, indicating the item is more difficult for the focal group). Even for bR values as small 
as 0, which is usually close to the mean of all the b values, the 3PL value of A is on average less than half the 
corresponding 2PL value when averaged over the 1 5 bR - bp values for DIF against the focal group for the case 
of c = 0.2 (approximately the average c value on the LSAT). Even when such items display substantial amounts 
of DIF against the focal group (bR- bp = -1 .0 to -1.5), A seldom reaches the level of a “B” item value, much 
less a “C” item. 

The panels in Figure 3 show how varying a affects A at the various values of bR and b F in the 3PL case 
(c = 0.2). For all items Ratio = 5 and focal group mean = -1.0. The primary pattern of A decreasing when bR 
increases, which was evident in Figure 2 is maintained across varying values of a, as shown in Figure 3. 
Additionally, varying a does have a large effect on A, as expected from the 2PL results discussed above. As was 
seen in Figure 2, the 3PL A values for the lowest values of bR followed the 2PL values [A = 4a(b R - b F ) ] 
more closely than for the higher values of bR for all levels of a. However, as a increases, the distortion in A 
becomes more dramatic as bR increases. Thus, the effect of varying a on A for low values of bR, was similar to 
that in the 2PL case — as a increases so does A. However, for higher values of bR, the effect of varying 
a was quite unexpectedly the opposite of what would be predicted by the 2PL formula — as a increased, 

A decreased, with the effect occurring most strongly for the case of DIF against the focal group. For example, 
when bR = 1 and bp = 2,ata = 0.5, 0.7, 0.9, 1.1, and 1.3, the respective A values are -0.957, -0.783, -0.603, 
-0.466, and -0.369. 
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FIGURE 3. Variation of A with bn (b value of reference group, indicated by the symbol), by (b value of 
focal group, indicated along the horizontal axis), and a (which varies across panels) for fixed values 
of c = .2, Ratio = 5, and focal group mean = -1. 0. 



The panels in Figure 4 show how varying the focal group mean affects A at the various values of by and by 
in the 3PL case (c = 0.2) for Ratio = 5 and a = 0.9. In all cases, the focal group 0’s were specified as following a 
normal distribution with unit standard deviation; the mean of the distribution used the values -1 (shown in Panel 
C of Figure 2), -.5, 0, .5, and 1 (shown in Panels A through D of Figure 4). Recall that the reference group’s 0’s 
were always specified as following a standard normal distribution. As in Figures 2 and 3, the most dramatic 
effect is that A decreases as bR increases. Figure 4 shows that varying the focal group mean has a noticeable 
effect on A, a result that the 2PL formula for A could not predict. The results show that the focal groups for 
which A is most reduced relative to its 2PL value are the groups with the lowest proficiency distributions (focal 
means of-1.0, -0.5, and 0.0; the focal group mean of-1.0 is shown in Panel C of Figure 2). Even though the 
results are better (i.e., A is distorted less compared to the 2PL formula) for the focal groups with means greater 
than 0, the shrinkage effect remains quite strong when bR > 1.0 (for the focal mean of 0.5) or when bR> 1.5 
(for the focal mean of 1.0). Moreover, the results for Figure 4 are only for a = 0.9, and Figure 3 showed that 
higher a values cause even larger distortions in A. The focal groups used in DIF analyses at LSAC typically 
have estimated means that are lower than the reference group’s mean; thus we clearly deal primarily with the 
larger distortions in A that result when the focal group’s mean is less than the reference group’s mean. 
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A. Focal Group Mean = -.5 




B. Focal Group Mean = 0 




C. Focal Group Mean = .5 D. Focal Mean = 1 




FIGURE 4. Variation of A with bR (b value of reference group , indicated by the symbol), bF (b value of 
focal group , indicated along the horizontal axis), and the mean of the focal group (which varies across panels) 
for fixed values of c = .2, a = .9, and Ratio = 5. 

The panels in Figure 5 show how varying the ratio of reference- to focal-group size affects A at the various 
values of bR and bF in the 3PL case (c = 0.2) for focal group mean = -1 and a = 0.9. Again, the most dramatic 
effect is that A decreases as bR increases. As indicated in Figure 5, varying the ratio of reference group size to 
focal group size has a relatively minor effect on A. 
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A. Ratio = 1 




B. Ratio = 2 




C. Ratio = 3 



D. Ratio = 4 





b Value of Focal Group 




b Value of Focal Group 



FIGURE 5. Variation of A with by (b value of reference group , indicated by the symbol), by (b value of focal 
group, indicated along the horizontal axis), and the ratio of the reference group size to the focal group size 
(which varies across panels) for fixed values of c = .2, a = .9, and focal group mean = -1. 0. 



Correspondence to Simulated Data Results 



Previous simulation studies by Allen and Donoghue (1996); Donoghue, Holland, and Thayer (1993); and 
Uttaro and Millsap (1994) have shown an unexplained tendency of the MH DIF estimator to decrease with 
difficulty level. The above results indicate that this tendency may be explained, at least in part, by the behavior 
of the MH DIF parameter itself. As an example, Table 2 shows the mean A values (in bold) from Allen and 
Donoghue (1996, their Table 4) alongside our MH DIF parameter values (the A’s, also in bold) where a was 
calculated from Equation 15. Various values of a and b r are shown. For all items, c = .2, bf = bn+. 4. As can be 
seen in Table 2, our MH DIF parameter values, A, reproduce Allen and Donoghue’s DIF estimator means 
(labelled “Mean A ” in the table) quite well. The amount that the parameter values differ from the estimated 
values can probably be attributed to the known estimation bias of the MH DIF estimator that occurs when the 
reference and focal group populations have a large difference in their proficiency means (as was the case for the 
data simulated in Allen and Donoghue, 1996; this bias is evident in the “No DIF” column — all of the A means 
would be within a standard error 2 or two of 0 if there was no bias). 
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TABLE 2 

Comparison of MH DIF estimates from Allen and Donoghue (1996) with 


our MH DIF parameter , A 




a 


bR 


No DIF (bp 

A 

Mean A 


SD of A 


A 

Mean A 


DIF (bf — bR +.4) 
SD of A 


A 


.5 


-2 


-.02 


.22 


-.73 


713 


-.73 


.5 


-i 


.00 


.20 


-.66 


.19 


-.67 


.5 


0 


.04 


.18 


-.52 


.19 


-.58 


.5 


1 


.06 


.17 


-.36 


.18 


-.44 


.5 


2 


.08 


.20 


-.18 


.20 


-.29 


1.0 


-2 


-.28 


.33 


- 1.71 


.28 


- 1.49 


1.0 


-1 


-.17 


.24 


- 1.42 


.20 


- 1.30 


1.0 


0 


-.02 


.19 


-.96 


.18 


-.95 


1.0 


1 


.07 


.19 


-.43 


.22 


-.48 


1.0 


2 


.14 


.19 


-.00 


.22 


-.15 


1.5 


-2 


-.54 


.46 


- 2.63 


.38 


- 2.21 


1.5 


-1 


-.28 


.23 


- 2.07 


.22 


- 1.85 


1.5 


0 


-.04 


.21 


- 1.17 


.21 


- 1.17 


1.5 


1 


.09 


.21 


-.31 


.22 


-.40 


1.5 


2 


.14 


.24 


.05 


.23 


-.06 



Note. In the calculation of mean and standard deviation of A , Allen and Dqnoghue (1996) used 150 replications. The “No DIF” 



estimates from Allen and Donoghue indicate the amount of bias present in A caused by the large difference in mean proficiency 
between the reference and focal groups. In all cases, c = .2. The reference group’s 9’s were sampled from a normal distribution with a 
mean of 0 and standard deviation of .7. The focal group’s 9’s were sampled from a normal distribution with a mean of -.7 and standard 
deviation of .8. The ratio of reference group size (5,199) to focal group size (1,959) was 4.857. These same 9 distributions and ratio were 
specified for the calculation of A, given by A = -2.35/n(a), where a is given in Equation 1 5. 

Discussion 

Before items are ever presented to examinees on an LSAT form, the items undergo an extensive sensitivity 
review process. Despite these precautions, some items may function differently in various subgroups (i.e., 
exhibit DIF). DIF statistics are designed to detect such items. Several DIF statistics have been developed, but 
the MH DIF procedure, which is used at LSAC, has become the most widely used methodology and is 
recognized as the testing industry standard. Although the behavior of the MH DIF estimator’s population 
parameter is known in 1PL and 2PL data, it has not been known in 3PL data because the formulation of the MH 
population parameter, A, has been an unsolved problem in the 3PL case. This lack of knowledge about A for 
3PL items has limited the evaluation of the statistical bias in A and has also hindered the understanding of the 
observed effects of simulation study factors on A . In particular, several researchers have found that the 
difficulty level of a 3PL DIF item can have a sizable effect on the magnitude of A (Allen & Donoghue, 1996; 
Donoghue, Holland, & Thayer, 1993; Uttaro & Millsap, 1994), but none of these studies could adequately 
explain the cause of this effect. 

The present statistical report formulated a population DIF parameter for the MH DIF estimator for any IRF 
model, including the 3PL model, and investigated its behavior with respect to a number of examinee and item 
factors through a systematic set of calculations. The findings presented here indicate that caution should be used 
in applying the MH DIF estimator to item response data that follow the 3PL model. In particular, the results 
indicate that the MH DIF estimator may exhibit reduced statistical power to detect DIF in 3PL items of medium 
or high difficulty, even when DIF is substantial (i.e., large difference in b’ s), especially when the focal group 
has a low mean proficiency. Additionally, it was shown that the behavior of the MH DIF population parameter 
can account for the unexplained behavior of A with respect to difficulty level observed in a past simulation 
study (Allen & Donoghue, 1996). The fact that A is smaller (in absolute value) than expected for 3PL items of 
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medium or high difficulty, as compared with 1PL or 2PL items, now can be explained because the value of the 
MH DIF parameter, A, also exhibits this pattern. Thus A should be used with caution until the apparent 
deficiencies of this procedure are better understood or corrected. 

The implications of this study on the routine operational task of identifying DIF at LSAC are still unknown 
because real data do not mimic simulated data exactly. However, because some items on the LSAT are known 
to exhibit guessing behavior, the results certainly suggest that additional research is warranted. 
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